An Analysis of Wildfires in British Columbia, 2012 to 2024

DATA6200, Assignment 2

Author

Jason Ives

1. Overview

In this project I will analyze data12 on wildfires in British Columbia (B.C.), Canada from 2012 to 2024. The analysis will look into wildfire size, frequency, distribution, and causes using data provided by the government of British Columbia, both for analysis and validation. The analysis will result in a series of visualizations assessing how size, frequency, and distribution of wildfires has changed over time, and how the nature of wildfires is impacted by the cause.

2. Data Cleaning

2.1. Workspace configuration

Before accessing the data, I will configure my workspace by adding the necessary libraries and declaring my custom functions.

Code
##UTILITY FUNCTION REFERENCE
#stat(df) - view df details
#sessionInfo() - view loaded packages
#rm(list = ls()) - clear environment
#sum(is.na[n](col)) - count na / nan in an element
#setwd() - set working directory

##FILES TO INCLUDE IN FINAL PACKAGE

gc()

##ADD LIBRARIES
library(sf)
library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(tmap)
library(leaflet)
library(janitor)
library(osmdata)

##DECLARE CUSTOM FUNCTIONS
##CONVENIENCE FUNCTION FOR VIEWING DISTINCT VALUES
Show_Distinct <- function(df, param) {
  df |> 
    distinct({{param}})
}

##IMPORT FILES IN FILENAMES VECTOR, CONCATENATE, RETURN AS SF OBJECT
##FOR FILES 2+:  IMPORT, CONFIRM CRS MATCH, BIND ROWS
##DIDN'T WORK - IT COULDN'T SEE VALUES IN LAYERS
Import_Multi_Kmz <- function(filenames) {
  i <- 1
  for (f in filenames) {
    unzip(f, exdir = "extracted_kml")
    layers <- st_layers("extracted_kml/doc.kml", do_count = TRUE)
      for (l in layers$layer_name) {
        sf_data <- st_read("extracted_kml/doc.kml", layer = l)
    
        if (i == 1) {
          kmz_data <- sf_data
        } else if (st_crs(kmz_data)[1]$input == st_crs(sf_data)[1]$input) {
          kmz_data <- bind_rows(kmz_data, sf_data)
        } else {
          print("CRS mismatch, ending data load")
          break
        }
        i <- i + 1
      }
  }
  return(kmz_data)
}

##PARSE HTML DESCRIPTION COLUMN
Parse_Description <- function(to_parse){
    vals <- read_html(to_parse) |> 
      xml_find_all("//td") |> 
      html_text()
    
    ##CAPTURE DATA IN ELEMENTS 3 TO END
    vals <- vals[3:length(vals)]
    
    ##SPLIT LIST OF CAPTURED ELEMENTS INTO NAMES AND VALUES
    name_nodes <- seq_along(vals) %% 2 == 1
    val_nodes <- seq_along(vals) %% 2 == 0
    parsed_names <- vals[name_nodes]
    parsed_vals <- vals[val_nodes]
    
    ##COMBINE SPLIT ITEMS INTO A NAMED VECTOR
    names(parsed_vals) <- parsed_names

    ##RETURN NAMED VECTOR AS DF
    return(data.frame(as.list(parsed_vals)))
}

2.2. Downloading and importing the .kmz point files

The primary data set I will use for my analysis is a series of .kmz geographic data files. I will unpack those kmz files, and join their data elements in a single sf object. I will also import a secondary data set using web scraping, that I will used to corroborate the primary data. This will ensure validity of the data used in the analysis.


It was noted on 11/6 in the class announcements that the data in the ftp archive might be more complete. I have considered this and explored the archived data a bit, and decided against utilizing the archived data. It seems likely that it was archived for a reason, and without specific insight as to why, I’m not prepared to declare it “better” data for this analysis just based on quantity. Data cleaning and data refinement often involve the dropping or editing of data points and cases, actions which improve the quality of the data by reducing the amount of data. With that in mind I will continue my analysis using the original data.

Code
##MANUALLY IMPORT ALL DATA AND RECORD YEAR FILE FOR ERROR CORRECTION
##2013 - 2015
unzip('BC Fire Points 2013-2015.kmz', exdir = "extracted_kml")
bc_fire_data <- st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2013") |> 
  mutate(data_year = 2013)
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2014")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2014, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2015")) |> 
    mutate(data_year = case_when(
    is.na(data_year) ~ 2015, 
    .default = data_year
  ))

##2015 - 2017
unzip('BC Fire Points 2015-2017.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2015")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2015, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2016")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2016, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Points 2017")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2017, 
    .default = data_year
  ))

##2017 - 2019
##NOTE LAYER NAMING SWITCHES TO "POINT" HERE
##NOTE - SPACE AT END OF NAME FOR 2017 AND 2018 LAYERS
unzip('BC Fire Points 2017-2019.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2017 ")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2017, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2018 ")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2018, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2019")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2019, 
    .default = data_year
  ))

##2018 - 2020
unzip('BC Fire Points 2018-2020.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2018")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2018, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2019")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2019, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2020")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2020, 
    .default = data_year
  ))

##2019 - 2021
unzip('BC Fire Points 2019-2021.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2019")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2019, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2020")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2020, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2021")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2021, 
    .default = data_year
  ))

##2020 - 2022
unzip('BC Fire Points 2020-2022.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2020")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2020, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2021")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2021, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2022")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2022, 
    .default = data_year
  ))

##2021 - 2023
unzip('BC Fire Points 2021-2023.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2021")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2021, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2022")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2022, 
    .default = data_year
  ))
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2023")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2023, 
    .default = data_year
  ))

##2024
unzip('BC Fire Points 2024.kmz', exdir = "extracted_kml")
bc_fire_data <- bind_rows(bc_fire_data, st_read("extracted_kml/doc.kml", layer = "BC Fire Point 2024")) |> 
  mutate(data_year = case_when(
    is.na(data_year) ~ 2024, 
    .default = data_year
  ))

##TIDY NAMES
bc_fire_data <- bc_fire_data |> 
  clean_names()

2.3. Remove duplicates and parse XML description column

The kml data contained overlapping dates of data, so I expect there to be duplicate records. I will remove some of them now. I will also check for data that represents the same observation but is slightly different later, but this is an easy first step to get the ball rolling.

The resulting fire data contains 3 columns: Name, geometry, and an html block called Description. This block contains the bulk of the attribute data relating to the fire. For this analysis, I will need to parse the data into format usable for analysis.

Code
##ADD INDEX COLUMN
bc_fire_data <- bc_fire_data |> 
  distinct() |> 
  mutate(key = row_number()) |> 
  relocate(key)

##PARSE XML, DROPPING BLANK DESCRIPTIONS FROM PARSING FIRST
parse_cases <- bc_fire_data |> 
  filter(nchar(description) > 0)

parsed_desc <- map_dfr(parse_cases$description, Parse_Description)
##ADD KEY ONTO PARSED DATA FOR JOINING
parsed_desc <- parsed_desc |> 
  mutate(key = parse_cases$key)

2.4. Consolidate columns in description data

Because the data comes from a period of about 10 years, the structure of the description data has changed significantly over time. The base extraction process captured the data in a wide format, so now I will consolidate that data where possible, to prepare the description data to be merged with the main fire data table.

Code
##NO OVERLAP - COMBINE CAUSE.GENERAL / CAUSE_GENERAL_KMZ
parsed_desc <- parsed_desc |> 
  mutate(CAUSE_GENERAL_KMZ = case_when(
    !is.na(CAUSE.GENERAL) ~ CAUSE.GENERAL, 
    .default = CAUSE_GENERAL_KMZ)) |> 
  select(-CAUSE.GENERAL)

##NO OVERLAP - COMBINE FIRENUMBER / FIRE_NUMBER
parsed_desc <- parsed_desc |> 
  mutate(FIRENUMBER = case_when(
    !is.na(FIRE_NUMBER) ~ FIRE_NUMBER, 
    .default = FIRENUMBER)) |> 
  select(-FIRE_NUMBER)

##NO OVERLAP - COMBINE FIRENUMBER / Fire.Number
parsed_desc <- parsed_desc |> 
  mutate(FIRENUMBER = case_when(
    !is.na(Fire.Number) ~ Fire.Number, 
    .default = FIRENUMBER)) |> 
  select(-Fire.Number)

##NO OVERLAP - FIREYEAR / FIRE_YEAR
parsed_desc <- parsed_desc |> 
  mutate(FIREYEAR = case_when(
    !is.na(FIRE_YEAR) ~ FIRE_YEAR, 
    .default = FIREYEAR)) |> 
  select(-FIRE_YEAR)

##NO OVERLAP - FIREYEAR / YEAR
parsed_desc <- parsed_desc |> 
  mutate(FIREYEAR = case_when(
    !is.na(YEAR) ~ YEAR, 
    .default = FIREYEAR)) |> 
  select(-YEAR)

##NOTE - FIRE SIZE MEASURED IN HECTARES
##NO OVERLAP - CURRENTSIZE / CURRENT_SIZE
parsed_desc <- parsed_desc |> 
  mutate(CURRENTSIZE = case_when(
    !is.na(CURRENT_SIZE) ~ CURRENT_SIZE, 
    .default = CURRENTSIZE)) |> 
  select(-CURRENT_SIZE)

##NO OVERLAP - CURRENTSIZE / FIRE_SIZE_HA
parsed_desc <- parsed_desc |> 
  mutate(CURRENTSIZE = case_when(
    !is.na(FIRE_SIZE_HA) ~ FIRE_SIZE_HA, 
    .default = CURRENTSIZE)) |> 
  select(-FIRE_SIZE_HA)

##NO OVERLAP - IGNITION_DATE / IGNITION.DATE
parsed_desc <- parsed_desc |> 
  mutate(IGNITION_DATE = case_when(
    !is.na(IGNITION.DATE) ~ IGNITION.DATE, 
    .default = IGNITION_DATE)) |> 
  select(-IGNITION.DATE)

##NO OVERLAP - GEOGRAPHIC_DESCRIPTION / GEOGRAPHIC
parsed_desc <- parsed_desc |> 
  mutate(GEOGRAPHIC_DESCRIPTION = case_when(
    !is.na(GEOGRAPHIC) ~ GEOGRAPHIC, 
    .default = GEOGRAPHIC_DESCRIPTION)) |> 
  select(-GEOGRAPHIC)

##FIRE_OF_NOTE_IND / WAS_FIRE_OF_NOTE_IND
##HAVE OVERLAP
##NOT IDENTICAL
##KEEPING BOTH

2.5. Join parsed description to fire data table

Now that I’ve consolidated the columns some, I will join the parsed description data back onto the fire data table, and drop the Description column.

Code
##TIDY COLUMN NAMES
parsed_desc <- parsed_desc |> 
  clean_names() |> 
  rename(
    fire_year = fireyear, 
    fire_number = firenumber, 
    fire_type = firetype, 
    current_size = currentsize, 
    cause = cause_general_kmz)

##JOIN PARSED DESCRIPTION DATA ONTO MAIN FIRE DATA
bc_fire_data <- bc_fire_data |> 
  left_join(parsed_desc, by = "key") |> 
  select(-description)

##UPDATE MISSING FIRE_YEAR VALUES FROM DATA_YEAR, THEN DROP DATA_YEAR
bc_fire_data <- bc_fire_data |> 
  mutate(fire_year = case_when(
    is.na(fire_year) ~ as.character(data_year), 
    .default = fire_year
  )) |> 
  select(-data_year)

##CLEAN UP PARSING FILES
rm(parse_cases)
rm(parsed_desc)

2.6. Continuing cleaning the fire data

Preparing the XML Description column for analysis was an important part of data cleaning. Now, that process can continue with a review of the contents of the bc_fire_data table. I will be looking for inconsistent data, resolvable null values, and any issues that might impact the quality of the analysis. First I will do some general standardization and cleanup of the data.

Code
##STANDARDIZE TEXT COLUMNS

bc_fire_data <- bc_fire_data |> 
  mutate(name = str_to_upper(str_trim(name))) |> 
  mutate(fire_number = str_to_upper(str_trim(fire_number))) |> 
  mutate(fire_type = str_to_upper(str_trim(fire_type))) |> 
  mutate(cause = str_to_upper(str_trim(cause))) |> 
  mutate(geographic_description = str_to_upper(str_trim(geographic_description))) |> 
  mutate(incident_name = str_to_upper(str_trim(incident_name))) |> 
  mutate(fire_of_note_ind = str_to_upper(str_trim(fire_of_note_ind))) |> 
  mutate(response_type_desc = str_to_upper(str_trim(response_type_desc))) |> 
  mutate(was_fire_of_note_ind = str_to_upper(str_trim(was_fire_of_note_ind))) |> 
  mutate(latitude = as.numeric(latitude)) |> 
  mutate(longitude = as.numeric(longitude))

2.6.1. Recode “<NULL>” / “<Null>” / “<NA>”

In some observations, null values have been recorded with a text value of “<NULL>”, “<Null>”, or “<NA>”. It is more appropriate for these to be recoded as NA, so that will be my next step.

Code
bc_fire_data <- bc_fire_data |> 
  mutate(current_size = case_when(
    current_size == "<Null>" ~ NA, 
    current_size == "<NA>" ~ NA, 
    .default = current_size
  )) |> 
  mutate(current_size = as.numeric(current_size)) |> 
  mutate(cause = case_when(
    cause == "<NULL>" ~ NA, 
    cause == "<NA>" ~ NA, 
    .default = cause
  )) |> 
  mutate(ignition_date = case_when(
    ignition_date == "<Null>" ~ NA, 
    ignition_date == "<NA>" ~ NA, 
    .default = ignition_date
  )) |> 
  mutate(geographic_description = case_when(
    geographic_description == "<NULL>" ~ NA, 
    geographic_description == "<NA>" ~ NA, 
    .default = geographic_description
  )) |> 
  mutate(incident_name = case_when(
    incident_name == "<NULL>" ~ NA, 
    incident_name == "<NA>" ~ NA, 
    .default = incident_name
  )) |> 
  mutate(response_type_desc = case_when(
    response_type_desc == "<NULL>" ~ NA, 
    response_type_desc == "<NA>" ~ NA, 
    .default = response_type_desc
  ))

2.6.2. Collapsing name, fire_number, and geographic_description

Name, fire_number, and geographic_description appear to have some overlap in their contents. I will review them to see if the columns can be consolidated without losing data.

Code
##REVIEW NAME, FIRE_NUMBER, GEOGRAPHIC_DESCRIPTION
##FOR CASES WHERE NAME == GEOGRAPHIC_DESCRIPTION AND FIRE_NUMBER EXISTS, UPDATE NAME TO FIRE_NUMBER

bc_fire_data <- bc_fire_data |> 
  mutate(name = case_when(
    name == geographic_description ~ fire_number,
    .default = name
  )) |> 
  select(-fire_number) ##NAME WILL NOW FULLY CAPTURE FIRE_NUMBER, REMOVE FIRE_NUMBER

2.6.3. Drop fire_type

A quick check reveals that fire_type only has the value FIRE or NA. This is self-evident and not super valuable for further analysis so I will drop it.

Code
bc_fire_data <- bc_fire_data |> 
  select(-fire_type)

2.6.4. Recode cause HUMAN / PERSON and rename

The data in cause falls into 3 categories (other than NA), HUMAN, LIGHTNING, and PERSON. HUMAN and PERSON represent the same cause, so I will recode PERSON to HUMAN.

Code
bc_fire_data <- bc_fire_data |> 
  mutate(cause = case_when(
    cause == "PERSON" ~ "HUMAN", 
    .default = cause
  ))

2.6.5. Break ignition_date into separate columns

Ignition date contains significant data, but it does not have a consistent format. To resolve that I will break out each of the components into their own column, and then drop ignition_date.


I considered dropping ignition_yr and using fire_year in it’s place, but there are a number of observations where fire_year is different from ignition_yr, so I will keep both for now.

Code
bc_fire_data <- bc_fire_data |> 
  mutate(ignition_yr = str_extract(ignition_date, "^[0-9]{4}")) |> 
  mutate(ignition_mo = str_remove_all(str_extract(ignition_date, "[-//][0-9]{2}[-//]"), "[-//]")) |> 
  mutate(ignition_day = as.numeric(str_remove(str_extract(ignition_date, "[-//][0-9]{2}[-//][0-9]{2}"), "[-//][0-9]{2}[-//]"))) |> 
  mutate(ignition_hr = as.numeric(str_trim(str_remove(str_extract(ignition_date, "\\s[0-9]{1,2}:"), ":")))) |> 
  mutate(ignition_min = as.numeric(str_remove_all(str_extract(ignition_date, ":[0-9]{2}:"), ":"))) |> 
  mutate(ignition_sec = as.numeric(str_trim(str_remove(str_extract(ignition_date, ":[0-9]{2}\\s"), ":")))) |> 
  mutate(ignition_am_pm = str_trim(str_extract(ignition_date, "\\s[A-Z]{2}$"))) |> 
  select(-ignition_date)

2.6.6. Flag duplicates and errors in geographic_description and move response type info to response_type_desc

A number of items in geographic_description are flagged as duplicates or errors. The context of these isn’t totally clear right now, so rather than drop these entirely, for now I’m going to create a drop_flag column, so they are easily excludeable later.

Once that is done I will move the information on response types contained in the geographic_description column into the response_type_desc column. Note that any with a combination flag will be categorized in the more severe category3.

Code
##FLAG ERROR AND DUPLICATE LINES
bc_fire_data <- bc_fire_data |> 
  mutate(drop_flag = case_when(
    str_detect(geographic_description, "DUPLICATE") ~ TRUE, 
    str_detect(geographic_description, "ERROR") ~ TRUE, 
    str_detect(geographic_description, "AMALGAMATED") ~ TRUE, 
    .default = FALSE
  ))

##TRANSFER RESPONSE INFO TO RESPONSE_TYPE_DESC
bc_fire_data <- bc_fire_data |> 
  mutate(response_type_desc = case_when(
    str_detect(geographic_description, "\\*MODIF") ~ "MODIFIED", 
    str_detect(geographic_description, "\\*MONIT") ~ "MONITOR", 
    str_detect(geographic_description, "\\*MONITOR/MODIFIED") ~ "MODIFIED", 
    str_detect(geographic_description, "\\-\\s*MODIFIED") ~ "MODIFIED", 
    str_detect(geographic_description, "\\-\\s*MONITOR") ~ "MONITOR", 
    str_detect(geographic_description, "MONITOR") ~ "MONITOR", 
    str_detect(geographic_description, "MODIFIED") ~ "MODIFIED", 
    .default = response_type_desc)) |> 
  mutate(geographic_description = case_when(
    str_detect(geographic_description, "\\*MODIF") ~ str_trim(str_replace(geographic_description, "\\*+MODIF.*?\\*+", "")), 
    str_detect(geographic_description, "\\*MONIT") ~ str_trim(str_replace(geographic_description, "\\*+MONIT.*?\\*+", "")), 
    str_detect(geographic_description, "\\*MONITOR/MODIFIED") ~ str_trim(str_replace(geographic_description, "\\*+MONITOR/MODIFIED.*?\\*+", "")), 
    str_detect(geographic_description, "\\*MR") ~ str_trim(str_replace(geographic_description, "\\*+MR.*?\\*+", "")), 
    str_detect(geographic_description, "\\-\\s*MODIFIED") ~ str_trim(str_replace(geographic_description, "\\-\\s*MODIFIED.*?$", "")), 
    str_detect(geographic_description, "\\-\\s*MONITOR") ~ str_trim(str_replace(geographic_description, "\\-\\s*MONITOR.*?$", "")), 
    str_detect(geographic_description, "^MONITOR") ~ str_trim(str_replace(geographic_description, "^MONITOR\\s*\\-*\\s*", "")), 
    str_detect(geographic_description, "^MONITOR") ~ str_trim(str_replace(geographic_description, "^MONITOR\\s*\\-*\\s*", "")), 
    str_detect(geographic_description, "/*\\s*MODIFIED") ~ str_trim(str_replace(geographic_description, "/*\\s*MODIFIED.*?$", "")), 
    str_detect(geographic_description, "/*\\s*MONITOR") ~ str_trim(str_replace(geographic_description, "/*\\s*MONITOR.*?$", "")), 
    .default = geographic_description
  ))

2.6.7. Convert fire_of_note_ind and was_fire_of_note_ind to bool

Both fire_of_note_ind and was_fire_of_note_ind are Y/N columns, so it might have more utility to convert them to boolean.

Code
bc_fire_data <- bc_fire_data |> 
  mutate(fire_of_note_ind = case_when(
    fire_of_note_ind == "Y" ~ TRUE, 
    fire_of_note_ind == "N" ~ FALSE, 
    .default = NA)) |> 
  mutate(was_fire_of_note_ind = case_when(
    was_fire_of_note_ind == "Y" ~ TRUE, 
    was_fire_of_note_ind == "N" ~ FALSE,
    .default = NA))

2.6.8. Convert eligibile columns to factors

Since I have gone through all the columns and reorganized the data, I can now convert the categorical columns in the data to factors.

Code
##CONVERT NUMERIC MONTH TO ABBREVIATION
month_cat <- c("JAN", "FEB", "MAR", "APR", "MAY","JUN", 
               "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")

bc_fire_data <- bc_fire_data |> 
  mutate(fire_year = factor(fire_year)) |>
  mutate(cause = factor(cause)) |> 
  mutate(ignition_yr = factor(ignition_yr)) |> 
  mutate(ignition_mo = factor(month_cat[as.numeric(ignition_mo)])) |> 
  mutate(ignition_am_pm = factor(ignition_am_pm)) |> 
  mutate(response_type_desc = factor(response_type_desc))

2.6.9 Re-de-duplicate the data

Now that the data across the columns has been standardized, moved around, and cleaned up, it is likely that there are new duplicate cases in the data that can be removed. I will do that now.

Code
##DROP FULL DUPLICATES
bc_fire_data <- bc_fire_data |>
  distinct(across(-key), .keep_all = TRUE)

##DETERMINE APPRORPRIATE RANGE
##https://stackoverflow.com/questions/639695/how-to-convert-latitude-or-longitude-to-meters
avg_meters_per_degree_long <- ((pi/180) * 6378 * cos(mean(bc_fire_data$latitude, na.rm = TRUE)/180)) * 1000
avg_meters_per_degree_lat <- ((pi/180) * 6378) * 1000

##TESTING METRICS FOR DISTANCE:
##2M - 30116
##5M - 30023
##10M - 30023
##25M - 30019
##50M - 30018
##75M - 30017
##100M - 30016
##1000M - 29982

##N = NUMBER OF METERS FOR EQUIVALENCE CHECK
n_long = 25
long_dist_deg = n_long / avg_meters_per_degree_long

##IDENTIFY ENTRIES W/IN N METERS OF EACH OTHER
dupe_indexes <- bc_fire_data |> 
  st_equals_exact(remove_self = TRUE, par = long_dist_deg)

##CONVERT DUPLICATE INDEXES INTO SINGLE COLUMN DF
dupe_indexes <- unlist(dupe_indexes)
dupe_indexes <- data.frame(dupe_indexes) |> 
  distinct()

##DETERMINE CASES TO KEEP
bc_fire_data_dupes <- bc_fire_data[dupe_indexes[[1]],]

##IDENTIFY DUPLICATES (NOT CONSIDERING GEOMETRY)
bc_fire_data_keep <- bc_fire_data_dupes |>
  st_drop_geometry() |> 
  distinct(across(-key), .keep_all = TRUE)

##ALL DUPE CASES - CASES TO KEEP = CASES TO DROP
dupe_keys_to_drop <- setdiff(bc_fire_data_dupes$key, bc_fire_data_keep$key)

##DROP IDENTIFIED CASES FROM MAIN TABLE
bc_fire_data <- bc_fire_data |> 
  filter(!key %in% dupe_keys_to_drop)

rm(bc_fire_data_dupes, bc_fire_data_keep, dupe_indexes, dupe_keys_to_drop)

3. Data Corroboration

To confirm the validity of the data, I will download a secondary source of the same data and review it against the primary data set.

Code
corrob_data <- read_html("https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics") |> 
  html_element("table") |> 
  html_table()

3.1. Minimal cleaning of corroborating data

I don’t want to clean the corroborating data much since it is for validation of the main fire data set, but I do want to remove the year from the Fire Number so it’s format matches the primary data. Also, I will tidy the column names just a bit using the janitor package.

Code
corrob_data <- corrob_data |> 
  clean_names() |> 
  mutate(fire_number = str_trim(str_remove(fire_number, "\\([0-9]{4}\\)")))

3.2. Joining data sets

Now that I have my corroborating data in place, I will join the matching records from the primary data set based on fire_number / name, and start comparing the data from the two data sets.

Code
corrob_data <- corrob_data |> 
  mutate(year = as.factor(year))

corrob_join <- corrob_data |> 
  left_join(bc_fire_data, by = join_by(fire_number == name, year == fire_year))

3.3. Resolving duplicates in the primary data highlighted by the corroborating data

After the join, there are some additional cases in the joined table. This indicates that some duplicates were found based on the join criteria, name and year. Review indicates that many of these are equivalent rows that just need to be consolidated. I will extract the rows, re-join them to the table with all matching rows, and backfill any empty fields in the main data table from that data. Then I will drop the duplicated columns, and drop duplicates from the table.

Code
##CREATE TABLE OF TYPE A (GEOG DESC NOT NA) DUPE LINES FOR RE-MERGING AND COPYING
line_merge_a <- bc_fire_data |> 
  mutate(fynum = paste0(name, fire_year)) |>
  add_count(fynum, name = "n") |> 
  filter(n > 1) |> 
  filter(!is.na(geographic_description)) |> 
  select(name, 
         fire_year, 
         cause, 
         geographic_description, 
         incident_name, 
         fire_of_note_ind, 
         response_type_desc, 
         was_fire_of_note_ind, 
         ignition_yr, 
         ignition_mo, 
         ignition_day, 
         ignition_hr, 
         ignition_min, 
         ignition_sec, 
         ignition_am_pm) |> 
  rename(dupe_a_cause = cause, 
         dupe_a_geographic_description = geographic_description, 
         dupe_a_incident_name = incident_name, 
         dupe_a_fire_of_note_ind = fire_of_note_ind, 
         dupe_a_response_type_desc = response_type_desc, 
         dupe_a_was_fire_of_note_ind = was_fire_of_note_ind, 
         dupe_a_ignition_yr = ignition_yr, 
         dupe_a_ignition_mo = ignition_mo, 
         dupe_a_ignition_day = ignition_day, 
         dupe_a_ignition_hr = ignition_hr, 
         dupe_a_ignition_min = ignition_min, 
         dupe_a_ignition_sec = ignition_sec, 
         dupe_a_ignition_am_pm = ignition_am_pm) |> 
  st_drop_geometry() |> 
  distinct()

##CREATE TABLE OF TYPE B (CAUSE NOT NA) DUPE LINES FOR RE-MERGING AND COPYING
line_merge_b <- bc_fire_data |> 
  mutate(fynum = paste0(name, fire_year)) |>
  add_count(fynum, name = "n") |> 
  filter(n > 1) |> 
  filter(!is.na(cause)) |> 
  select(name, 
         fire_year, 
         cause, 
         geographic_description, 
         incident_name, 
         fire_of_note_ind, 
         response_type_desc, 
         was_fire_of_note_ind, 
         ignition_yr, 
         ignition_mo, 
         ignition_day, 
         ignition_hr, 
         ignition_min, 
         ignition_sec, 
         ignition_am_pm) |> 
  rename(dupe_b_cause = cause, 
         dupe_b_geographic_description = geographic_description, 
         dupe_b_incident_name = incident_name, 
         dupe_b_fire_of_note_ind = fire_of_note_ind, 
         dupe_b_response_type_desc = response_type_desc, 
         dupe_b_was_fire_of_note_ind = was_fire_of_note_ind, 
         dupe_b_ignition_yr = ignition_yr, 
         dupe_b_ignition_mo = ignition_mo, 
         dupe_b_ignition_day = ignition_day, 
         dupe_b_ignition_hr = ignition_hr, 
         dupe_b_ignition_min = ignition_min, 
         dupe_b_ignition_sec = ignition_sec, 
         dupe_b_ignition_am_pm = ignition_am_pm) |> 
  st_drop_geometry() |> 
  distinct()

#MERGE COLUMNS FROM DUPE TABLES BACK ONTO MAIN
bc_fire_data <- bc_fire_data |> 
  left_join(line_merge_a, by = join_by(name, fire_year)) |> 
  left_join(line_merge_b, by = join_by(name, fire_year))

##COPY DATA FROM DUPE COLUMNS INTO EMTPY CELLS IN MAIN TABLE.  COPYING BOTH WAYS WILL ALLOW ME TO DROP DUPES.
bc_fire_data <- bc_fire_data |> 
  mutate(cause = case_when(
    (is.na(cause)) & (!is.na(dupe_a_cause)) ~ dupe_a_cause, 
    (is.na(cause)) & (!is.na(dupe_b_cause)) ~ dupe_b_cause, 
    .default = cause)) |> 
  mutate(geographic_description = case_when(
    (is.na(geographic_description)) & (!is.na(dupe_a_geographic_description)) ~ dupe_a_geographic_description, 
    (is.na(geographic_description)) & (!is.na(dupe_b_geographic_description)) ~ dupe_b_geographic_description, 
    .default = geographic_description)) |> 
  mutate(incident_name = case_when(
    (is.na(incident_name)) & (!is.na(dupe_a_incident_name)) ~ dupe_a_incident_name, 
    (is.na(incident_name)) & (!is.na(dupe_b_incident_name)) ~ dupe_b_incident_name, 
    .default = incident_name)) |> 
  mutate(fire_of_note_ind = case_when(
    (is.na(fire_of_note_ind)) & (!is.na(dupe_a_fire_of_note_ind)) ~ dupe_a_fire_of_note_ind, 
    (is.na(fire_of_note_ind)) & (!is.na(dupe_b_fire_of_note_ind)) ~ dupe_b_fire_of_note_ind, 
    .default = fire_of_note_ind)) |> 
  mutate(response_type_desc = case_when(
    (is.na(response_type_desc)) & (!is.na(dupe_a_response_type_desc)) ~ dupe_a_response_type_desc, 
    (is.na(response_type_desc)) & (!is.na(dupe_b_response_type_desc)) ~ dupe_b_response_type_desc, 
    .default = response_type_desc)) |> 
  mutate(was_fire_of_note_ind = case_when(
    (is.na(was_fire_of_note_ind)) & (!is.na(dupe_a_was_fire_of_note_ind)) ~ dupe_a_was_fire_of_note_ind, 
    (is.na(was_fire_of_note_ind)) & (!is.na(dupe_b_was_fire_of_note_ind)) ~ dupe_b_was_fire_of_note_ind, 
    .default = was_fire_of_note_ind)) |> 
  mutate(ignition_yr = case_when(
    (is.na(ignition_yr)) & (!is.na(dupe_a_ignition_yr)) ~ dupe_a_ignition_yr, 
    (is.na(ignition_yr)) & (!is.na(dupe_b_ignition_yr)) ~ dupe_b_ignition_yr, 
    .default = ignition_yr)) |> 
  mutate(ignition_mo = case_when(
    (is.na(ignition_mo)) & (!is.na(dupe_a_ignition_mo)) ~ dupe_a_ignition_mo, 
    (is.na(ignition_mo)) & (!is.na(dupe_b_ignition_mo)) ~ dupe_b_ignition_mo, 
    .default = ignition_mo)) |> 
  mutate(ignition_day = case_when(
    (is.na(ignition_day)) & (!is.na(dupe_a_ignition_day)) ~ dupe_a_ignition_day, 
    (is.na(ignition_day)) & (!is.na(dupe_b_ignition_day)) ~ dupe_b_ignition_day, 
    .default = ignition_day)) |> 
  mutate(ignition_hr = case_when(
    (is.na(ignition_hr)) & (!is.na(dupe_a_ignition_hr)) ~ dupe_a_ignition_hr, 
    (is.na(ignition_hr)) & (!is.na(dupe_b_ignition_hr)) ~ dupe_b_ignition_hr, 
    .default = ignition_hr)) |> 
  mutate(ignition_min = case_when(
    (is.na(ignition_min)) & (!is.na(dupe_a_ignition_min)) ~ dupe_a_ignition_min, 
    (is.na(ignition_min)) & (!is.na(dupe_b_ignition_min)) ~ dupe_b_ignition_min, 
    .default = ignition_min)) |> 
  mutate(ignition_sec = case_when(
    (is.na(ignition_sec)) & (!is.na(dupe_a_ignition_sec)) ~ dupe_a_ignition_sec, 
    (is.na(ignition_sec)) & (!is.na(dupe_b_ignition_sec)) ~ dupe_b_ignition_sec, 
    .default = ignition_sec)) |> 
  mutate(ignition_am_pm = case_when(
    (is.na(ignition_am_pm)) & (!is.na(dupe_a_ignition_am_pm)) ~ dupe_a_ignition_am_pm, 
    (is.na(ignition_am_pm)) & (!is.na(dupe_b_ignition_am_pm)) ~ dupe_b_ignition_am_pm, 
    .default = ignition_am_pm))

##REMOVE DUPE COLUMNS
bc_fire_data <- bc_fire_data |> 
  select(-contains("dupe_"))

##CREATE LIST OF ROWS TO KEEP BY ROUNDING LAT / LONG AND KEEPING DISTINCT
bc_fire_data_no_geo <- bc_fire_data |> 
  st_drop_geometry() |> 
  mutate(latitude = round(latitude, digits = 2)) |> 
  mutate(longitude = round(longitude, digits = 2)) |> 
  distinct(across(-key), .keep_all = TRUE) |> 
  select(key)

bc_fire_data <- bc_fire_data |>
  filter(key %in% bc_fire_data_no_geo$key)

rm(bc_fire_data_no_geo, line_merge_a, line_merge_b)

3.4. Returning to joining the data sets

With the data further cleaned and de-duplicated, I will now re-join the main and corroboratory data sets and examine the results.

I’ve noted that there are still a few additional records created by the join. I’ve spot checked the current cases with a duplicate name and fire_year combination, and it wasn’t obvious that they were the same case. With that in mind, I will manually compare the duplicated cases, and use the corroborating data to see if one case or the other of the duplicates should be dropped.

Code
corrob_join <- corrob_data |> 
  left_join(bc_fire_data, by = join_by(fire_number == name, year == fire_year))

##MANUAL DROPS
bc_fire_data <- bc_fire_data |> 
  filter(
    key != 9063, ##NO VISIBLE DIFFERENCE, NOT SURE WHY BOTH EXIST, DROPPING EARLIER
    key != 9317, ##SLIGHT DIFFERENCE IN LATITUDE, DROPPING EARLIER
    key != 9145, ##SLIGHT DIFFERENCE IN LATITUDE AND LONGITUDE, DROPPING EARLIER
    key != 9129, ##SLIGHT DIFFERENCE IN LATITUDE, BIGGER DIFFERENCE IN SIZE - DROPPING EARLIER / SMALLER
    key != 8384, ##MODERATE DIFFERENCE IN LATITUDE, DROPPING EARLIER
    key != 8144, ##SLIGHT DIFFERENCE IN LATITUDE AND LONGITUDE, DROPPING EARLIER
    key != 3796, ##MODERATE DIFFERENCE IN SIZE, DROPPING EARLIER / SMALLER
    key != 4275, ##MODERATE DIFFERENCE IN LATITUDE AND LONGITUDE, KEEPING LATER
    key != 4405, ##MODERATE DIFFERENCE IN LATITUDE AND LONGITUDE, KEEPING LATER
    key != 5047 ##MODERATE DIFFERENCE IN SIZE, DROPPING EARLIER / SMALLER
  )
##REJOIN AFTER DROPS
corrob_join <- corrob_data |> 
  left_join(bc_fire_data, by = join_by(fire_number == name, year == fire_year))

##DROP ROWS THAT ARE IN CORRAB DATA BUT NOT IN MAIN
corrob_join <- corrob_join |> 
  semi_join(bc_fire_data, by = join_by(fire_number == name, year == fire_year))

3.5. Reviewing the corroborated and main data sets

With the duplicates cleaned up, and extra cases no longer being created in the joined table, I can compare the corroborating data with the primary data. I plan to compare the following aspects:

  • Size
  • Geographic description
  • Date of discovery
  • Location (latitude, longitude)

If these elements align well for this sample, it will give me confidence in my main data set.

3.5.1 Comparing the sizes

First I will compare the sizes of the fires in the two data sets. I will check for differences and missing data.


Results:

  • Mismatched values: 1 / 916

Action Items:

  • None

With only a single case difference between the main an corroborative data, I can feel confident moving forward using current_size.

Code
##NUMBER OF SIZES THAT DON'T MATCH
corrob_join |> 
  filter(size_ha != current_size)

##NUMBER OF VALUES IN CORROBORATION DATA THAT ARE NOT IN MAIN DATA
corrob_join |>
  filter(!is.na(size_ha) & is.na(current_size))

##NUMBER OF VALUES IN MAIN DATA THAT ARE NOT IN CORROBORATION DATA
corrob_join |>
  filter(is.na(size_ha) & !is.na(current_size))

3.5.2. Comparing geographic descriptions

I will follow a similar process regarding geographic description, comparing the values and looking for differences. I will investigate any differences further to inform the validity of this data.


Results:

  • Mismatched values: 27 / 916
  • Mismatched with substantive difference: 1 / 916
  • NA geographic description where geographic has value - 700 / 916

Action Items:

  • Add additional geographic data to bc_fire_data

The mismatched values are largely due to the earlier transfer of response data from geographic_description to response_type_desc. There is one substantive difference related to the location of the fire relative to a river, but without further distinguishing information that will have to suffice.

So in terms of corroboration, a single difference has been found, which is acceptable and I can be confident moving forward with using geographic_description in my analysis.

Separately, 700 additional geographic entries were detected, for cases where geographic_description does not have a value. Since the shared cases were so consistent when compared with each other, I’m also confident in transferring those values to geographic_description for potential use in further analysis.

Code
##NUMBER OF GEOGRAPHIC DESCRIPTIONS THAT DON'T MATCH
corrob_join |> 
  select(geographic, geographic_description) |> 
  mutate(geographic = toupper(geographic)) |> 
  filter(geographic != geographic_description)

##NUMBER OF VALUES IN CORROBORATION DATA THAT ARE NOT IN MAIN DATA
corrob_join |>
  filter(!is.na(geographic) & is.na(geographic_description))

##NUMBER OF VALUES IN MAIN DATA THAT ARE NOT IN CORROBORATION DATA
corrob_join |>
  filter(is.na(geographic) & !is.na(geographic_description))

##TRANSFER OF GEOGRPAHIC VALUES TO GEOGRAPHIC_DESCRIPTION
geo_transfer <- corrob_join |> 
  select(key, geographic, geographic_description) |> 
  mutate(geographic = toupper(geographic)) |> 
  filter(!is.na(geographic) & is.na(geographic_description)) |> 
  select(-geographic_description)

bc_fire_data <- bc_fire_data |> 
  left_join(geo_transfer, by = join_by(key ==  key)) |> 
  mutate(geographic_description = case_when(
    ((!is.na(geographic)) & (is.na(geographic_description))) ~ geographic, 
    .default = geographic_description
  )) |> 
  select(-geographic)

rm(geo_transfer)

3.5.3. Comparing discovery / ignition dates

I’d like to also examine the relationship between discovery and ignition dates, although there is no guarantee that the represent the same data. If they are clearly quite different I will probably just leave them alone and move on, but if they are quite similar, I will examine the differences.


Results:

  • Mismatched values: 104 / 916
  • Mismatched with substantive difference: 1 / 916
  • NA geographic description where geographic has value - 700 / 916

Action Items:

- None

Because of my initial reservations about comparing the two dates, I’ve decided to take no action. If I had validated their equivalence I could have moved the 700 additional values from the corroborating data to the main fire data. The 2 values clearly have a lot of overlap and represent related concepts, however, the weaker alignment of these values as compared with previous comparisons and the conceptual difference in the variable naming has led me to take no action.

In effect I have established that these are likely conceptually different values and should be kept separate.

Code
date_review <- corrob_join |> 
  mutate(discovery_mo = toupper(str_extract(discovery_date, "\\w{3}"))) |> 
  mutate(discovery_day = str_trim(str_extract(discovery_date, "\\s\\d{1,2}"))) |> 
  mutate(discovery_yr = str_extract(discovery_date, "\\d{4}$")) |> 
  mutate(discovery_combo = ymd(paste0(discovery_yr, discovery_mo, discovery_day))) |> 
  mutate(ignition_combo = ymd(paste0(ignition_yr, ignition_mo, ignition_day))) |> 
  mutate(date_diff = abs(as.numeric(discovery_combo - ignition_combo)))

##NUMBER OF FIRE DATES THAT DON'T MATCH
date_review |> 
  select(discovery_combo, ignition_combo) |> 
  filter(discovery_combo != ignition_combo)

##NUMBER OF VALUES IN CORROBORATION DATA THAT ARE NOT IN MAIN DATA
date_review |>
  filter(!is.na(discovery_combo) & is.na(ignition_combo))

##NUMBER OF VALUES IN MAIN DATA THAT ARE NOT IN CORROBORATION DATA
date_review |>
  filter(is.na(discovery_combo) & !is.na(ignition_combo))

##SUMMARY STATISTICS OF DATE DIFFERENCE
date_stats <- date_review |> 
  summarise(abs_mean = mean(date_diff, na.rm = TRUE), abs_median = median(date_diff, na.rm = TRUE), sd = sd(date_diff, na.rm = TRUE))

zero_count <- date_review |> 
  filter(date_diff == 0) |> 
  count()
Code
##PLOT DIFFERENCE IN DATES WITH MEAN AND SD
date_review |> 
  select(date_diff) |> 
  filter(!is.na(date_diff)) |> 
  ggplot(aes(x = date_diff)) + 
  geom_histogram(color = "mediumpurple3", fill = "gray67") + 
  coord_cartesian(ylim = c(0,30)) + 
  annotate("segment", x = 20, y = 27, xend = 2, yend = 30,
                     arrow = arrow(length = unit(0.3, "cm")), color = "gray23") + 
  annotate("text", x = 31, y = 26, size = 3, label = paste0("Truncated for visibility,\nTotal 0 difference = ", zero_count$n), color = "gray23") + 
  annotate("text", x = date_stats$abs_mean + 3, y = 31, size = 2.5, label = "Mean", color = "dodgerblue3") + 
  annotate("text", x = date_stats$abs_mean + date_stats$sd + 4, y = 31, size = 2.5, label = "1 Std dev", color = "tomato1") + 
  geom_vline(xintercept = date_stats$abs_mean, linetype = 2, color = "dodgerblue3") + 
  geom_vline(xintercept = date_stats$abs_mean + date_stats$sd, linetype = 2, color = "tomato1") + 
  theme_minimal() + 
  labs(
  x = "Absolute difference (days)", 
  y = "Frequency",
  title = "Corroborating ignition date using discovery date",
  subtitle = "While the two seem to be related, they also have significant differences that limit use in corroboration",
  caption = "\nData from Governement of British Columbia fire tracking, 2012 - 2024.\nhttps://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics\nhttps://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/")

Ignition date and discovery date are mostly equivalent, but the number and distribution of differences suggest that maybe they are related but distinct concepts. As such data for both will be retained.
Code
rm(date_stats, zero_count, date_review)

3.5.4. Comparing locations using latitude and longitude

The validation of the fire locations is critical for corroborating the overall quality of the data. I will split the corrob_join table into two tables, and make each a simple features (sf) object using the two sets of coordinates.

The latitude / longitude values in each table follow different formats. In the corroboration table they are stored in degrees decimal minutes format, while in the fire data table they are stored in decimal degrees. I will convert the format4 of the corroborating data to decimal degrees for compatibility.

With the location data in the same format, it also becomes clear that that the corroborating longitude is flipped, it is positive but it should be negative. I will correct the sign before continuing the analysis.

Once the coordinate systems are matched, I will do a geospatial analysis for equivalence, similar to the one I did in section 2.6.9. above.

Code
##CONVERT CORROB LAT / LONG IN DEGREES DECIMAL MINUTES TO DECIMAL DEGREES
##FORMULA = DEGREES + (MINUTES / 60)
corrob_join <- corrob_join |> 
  mutate(latitude.x = 
           as.numeric(str_trim(str_extract(latitude.x, "^\\d*\\s"))) + 
           (as.numeric(str_trim(str_extract(latitude.x, "\\s.*$"))) / 60)
  ) |> 
   mutate(longitude.x = 
           as.numeric(str_trim(str_extract(longitude.x, "^\\d*\\s"))) + 
           (as.numeric(str_trim(str_extract(longitude.x, "\\s.*$"))) / 60)
  ) 

##SF OBJECT BASED ON BC_FIRE_DATA (ALREADY HAS SF COLUMN)
fire_sf <- st_as_sf(corrob_join) |> 
  filter(!is.na(latitude.x)) |> 
  filter(!is.na(longitude.x)) |> 
  filter(!is.na(latitude.y)) |> 
  filter(!is.na(longitude.y)) |> 
  select(-latitude.x, -longitude.x) |> 
  rename(longitude = longitude.y, latitude = latitude.y) |> 
  arrange(key)

##SF OBJECT BASED ON CORROBORATING DATA, CREATE SF COLUMN
corrob_sf <- st_as_sf(corrob_join) |> 
  filter(!is.na(latitude.x)) |> 
  filter(!is.na(longitude.x)) |> 
  filter(!is.na(latitude.y)) |> 
  filter(!is.na(longitude.y)) |> 
  select(-latitude.y, -longitude.y, -geometry) |> 
  rename(longitude = longitude.x, latitude = latitude.x) |> 
  mutate(longitude = -longitude) |> 
  st_drop_geometry() |> 
  st_as_sf(coords = c("longitude", "latitude"), crs = st_crs(fire_sf)) |> 
  arrange(key)

3.5.4.1. Geospatial proximity analysis for corroborating location data

To corroborate the latitude and longitude data, I will test the proximity of the fire and corroboratory data to each other, within a tolerance of 2 meters.


Results:

  • Point pairs with proximity > 2 meters: 9
  • Point pairs with proximity > 5 meters: 5

Action Items:

  • Add 2012 corroborating data to main data set

With only 9 (~1%) of point pairs showing a separation of larger than 2 meters, the location data is broadly corroborated and can be trusted for a robust analysis. At this point I don’t have a good way to reconcile the 9 cases with a difference of more than 2 meters, so I will move forward but I will keep a eye out for any unusual data that might indicate a mislocated point.


With the corroboration process complete and the data sets largely in agreement, I feel confident in transferring the additional 2012 data from the corroboration set to my main data set and using it in the analysis phase. This will allow me to extend my analysis by 1 year.

Code
##IDENTIFY ENTRIES W/IN N METERS OF EACH OTHER
corrob_proximity <- fire_sf |> 
  ##WARNING - THIS IS BASICALLY AN UNVALIDATED JOIN, MAKE VERY SURE TABLES ALIGN PERFECTLY
  ##RETURNS IN METERS IF CRS IS GEOGRAPHIC (LAT / LONG)
  st_distance(st_geometry(corrob_sf), by_element = TRUE)

##ADD COLUMN OF PROXIMITIES TO CORROB_JOIN
##AS WITH PREVIOUS STEP, PROPER ORDER IS ASSUMED SO MAKE SURE IT IS CORRECT
corrob_join <- corrob_join |> 
  arrange(key) |> 
  mutate(proximity = corrob_proximity)

##REVIEW POINTS MORE THAN 2M APART
corrob_join |> 
  mutate(proximity = as.numeric(proximity)) |> 
  filter(proximity > 2)

##TRANSFER 2012 DATA FROM CORROBORATING DATA SET TO MAIN DATA SET
corrob_transfer <- corrob_data |> 
  ##CONVERT CORROB DATA WITH 2012 VALUES TO DECIMAL DEGREES
  mutate(latitude = 
           as.numeric(str_trim(str_extract(latitude, "^\\d*\\s"))) + 
           (as.numeric(str_trim(str_extract(latitude, "\\s.*$"))) / 60)
  ) |> 
  mutate(longitude = 
           as.numeric(str_trim(str_extract(longitude, "^\\d*\\s"))) + 
           (as.numeric(str_trim(str_extract(longitude, "\\s.*$"))) / 60)
  ) |> 
  mutate(longitude = -longitude) |> 
  filter(as.numeric(as.character(year)) == 2012) |> 
  st_as_sf(coords = c("longitude", "latitude"), crs = st_crs(fire_sf)) |> 
  mutate(key = row_number() + 77776) |> 
  mutate(name = toupper(fire_number)) |> 
  mutate(fire_year = as.factor(year)) |> 
  mutate(current_size = size_ha) |> 
  mutate(cause = as.factor(NA)) |> 
  mutate(geographic_description = toupper(geographic)) |> 
  mutate(incident_name = as.character(NA)) |> 
  mutate(fire_of_note_ind = NA) |> 
  mutate(response_type_desc = as.character(NA)) |> 
  mutate(was_fire_of_note_ind = NA) |> 
  mutate(ignition_yr = as.factor(NA)) |> 
  mutate(ignition_mo = as.factor(NA)) |> 
  mutate(ignition_day = NA) |> 
  mutate(ignition_hr = NA) |> 
  mutate(ignition_min = NA) |> 
  mutate(ignition_sec = NA) |> 
  mutate(ignition_am_pm = as.factor(NA)) |> 
  mutate(drop_flag = FALSE) |> 
  select(key, name, fire_year,
         current_size, cause, geographic_description, 
         incident_name, fire_of_note_ind, response_type_desc, 
         was_fire_of_note_ind, ignition_yr, ignition_mo, 
         ignition_day, ignition_hr, ignition_min, ignition_sec, 
         ignition_am_pm, drop_flag)

bc_fire_data <- bc_fire_data |> 
  bind_rows(corrob_transfer)

#rm(corrob_data, corrob_join, corrob_sf, corrob_transfer)

4. Analysis and visualization

Having cleaned and corroborated the data, I’m ready to undertake an analysis of the the fires recorded in British Columbia between 2012 and 2024. This analysis will endeavor to answer to main questions.

  1. How has the size, frequency, and distribution of wildfires changed from 2012-2024?
  2. How does the cause of the fire relate to the size, frequency and distribution of the wildfires?

I will create a set of plots to answer each of these questions.

4.1. Preparing map elements

To get started with the analysis and visualization process I will need to download polygons representing British Columbia, as well as the districts within British Columbia. I will then use a spatial join to select only the district borders within British Columbia.

Code
##BRITISH COLUMBIA BOUNDING BOX COORDINATES
bc_bounding_box = c(
  left = -139.06, 
  bottom = 48.30, 
  right = -114.03, 
  top = 60.00)

##OSM QUERY FOR B.C. BORDER
bc_border <- opq(bc_bounding_box, timeout = 70) |>
  add_osm_feature(key = "boundary", value = "administrative") |> 
  add_osm_feature(key = "admin_level", value = "4") |> 
  osmdata_sf() |> 
  pluck("osm_multipolygons") |> 
  st_make_valid() |> 
  filter(name == "British Columbia") |> 
  select(name) |> 
  rename(province = name)

##OSM QUERY FOR B.C. DISTRICTS
bc_districts <- opq(bc_bounding_box, timeout = 70) |>
  add_osm_feature(key = "boundary", value = "administrative") |> 
  add_osm_feature(key = "admin_level", value = "6") |> 
  osmdata_sf() |> 
  pluck("osm_multipolygons") |> 
  st_make_valid() |> 
  rename(district = name)

##SPATIAL JOIN THE TABLES TO SELECT ONLY DISTRICTS IN BC
bc_districts <- st_join(bc_districts, bc_border, join = st_within, left = TRUE) |> 
  filter(province == "British Columbia")

4.2. Analysis of changes in wildfire size from 2013 to 2024

Wildfires appear to be a growing concern in British Columbia. Anecdotally, it seems that wildfires are larger, more frequent, longer lasting, and more impactful. In this analysis I will dig deeper into the nature of the wildfires in British Columbia over the 13 year study period.

I will begin by examining the distribution of wildfire size over the years to look for patterns of growth or decline in wildfire size.


To prepare my data for plotting I will break it into quartiles across all years. Then I will group it by year, to examine how many fires of each quartile fall into each year grouping. This should accentuate any patterns in fire distribution among the quartiles over time.

It is worth noting that the data sets did not contain any fire size data for 2023. I have included an empty bar in the plot and annotated this discrepancy so viewers are aware. Data for 2012 was dropped as well. Since it only includes data for larger fires, it’s distribution is not comparable to other years. Because 2012 is on the end of the scale, it was simply dropped, without representation in the plot. Lastly, a current_size value of .009 seems to have been used as something of a default value, particularly in earlier years. This was drastically skewing the distribution, to the point that values of .009 were still appearing the 2nd quartile. As such I’ve dropped all cases with a current_size of less than or equal to .009, so the data will have a more aligned distribution across the years.

Code
##PREPARE OVERALL DATA
plot_1_data <- bc_fire_data |> 
  filter(drop_flag == FALSE) |> 
  st_drop_geometry() |> 
  select(fire_year, current_size) |> 
  filter(!is.na(current_size)) |> 
  filter(current_size > .009) |> 
  filter(fire_year != 2012) |> 
  mutate(quartile = ntile(current_size, 4)) |> 
  mutate(fire_year = factor(fire_year, levels = c("2013", "2014", "2015", "2016", 
                                                  "2017", "2018", "2019", "2020", "2021", 
                                                  "2022", "2023", "2024"))) |> 
  arrange(as.numeric(as.character(fire_year)))

##CALCULATE TOTAL FIRES BY YEAR AND QUARTILE
plot_1_yr_q_totals <- plot_1_data |> 
  group_by(fire_year, quartile) |> 
  summarise(yr_q_count = n()) |> 
  ungroup()

##CALCULATE TOTAL FIRES BY YEAR
plot_1_yr_totals <- plot_1_data |> 
  group_by(fire_year) |> 
  summarise(yr_count = n()) |> 
  ungroup()

##JOIN FIRES BY YEAR TO FIRES BY YEAR/QUARTILE AND CALCULATE % BY QUARTILE
plot_1_yr_q_totals <- plot_1_yr_q_totals |> 
  left_join(plot_1_yr_totals, by = join_by(fire_year)) |> 
  mutate(yr_q_pct = yr_q_count / yr_count)

##PLOT
ggplot(plot_1_yr_q_totals, 
       aes(fill = factor(quartile), x = fire_year, y = yr_q_pct)) + 
  geom_bar(position="fill", stat = "identity") + 
  scale_x_discrete(drop = FALSE) + 
  theme_minimal() + 
  scale_fill_manual(values = c("firebrick4", "firebrick2", "orange1", "lightgoldenrod1"), 
                    labels = c("Largest", "", "", "Smallest")) + 
  labs(
  x = "",
  y = "", 
  title = "Are wildfires in British Columbia getting bigger?  Maybe not...",
  subtitle = "By splitting all available wildifire data into 4 groups based on size, and then separating them out by year,\nwe can look for patterns in the distribution of fires among the groups over time. Outside of a steady\nincrease in the number of larger fires between 2018 and 2020, no clear patterns emerged.\n\nFire size data for 2023 not available.", 
  caption = "\nData from Governement of British Columbia fire tracking, 2012 - 2024.\nhttps://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics\nhttps://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/", 
  fill = "Fire Size"
   ) + 
  theme_minimal() + 
  theme(axis.text.y = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.grid.major.y = element_line(color = "gray80"), 
        panel.grid.major.x = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0), 
        legend.title = element_text(size = 10), 
        axis.title.x = element_text(size = 14), 
        axis.text.x = element_text(size = 12), 
        plot.title = element_text(size = 18))

An analysis of the distribution of wildfire sizes across study years shows little in the way of patterns. From 2018 to 2020 there does seem to be a steady increase in the number of large fires, but even that might be attributable to a recovery from an elevated proportion of small fires in 2018. In the end there is little here to support the idea that the size of fires in British Columbia has been changing over time.
Code
##REMOVE PLOT SPECIFIC TABLES
rm(plot_1_data, plot_1_yr_q_totals, plot_1_yr_totals)

4.3. Investigating the relationship between cause and size

Even though we found no clear patterns in the change in wildfire size over time, examining the relationship between fire cause and fire size might help us better understand the factors that influence fire size. To do this we’ll add fire cause to our previous analysis and look for patterns in the cause across size and year.


To preserve the context of the data as we examine the added factor of cause, I will utilize the same plot model as in the previous plot and the same base data preparation constraints. However, to capture cause effecively, I will split the stacked plot into four separate faceted plots, one for each size category. Within each of those plots the distribution of cause will be captured as a gradient fill. Years without cause data will utilize a grey fill to accentuate the lack of data in those years, while still preserving the continuity from the previous plot.

Code
##PREPARE OVERALL DATA
plot_2_data <- bc_fire_data |> 
  filter(drop_flag == FALSE) |> 
  st_drop_geometry() |> 
  select(fire_year, current_size, cause) |> 
  filter(!is.na(current_size)) |> 
  filter(current_size > .009) |> 
  filter(fire_year != 2012) |> 
  mutate(quartile = ntile(current_size, 4)) |> 
  mutate(fire_year = factor(fire_year, levels = c("2013", "2014", "2015", "2016", 
                                                  "2017", "2018", "2019", "2020", "2021", 
                                                  "2022", "2023", "2024"))) |> 
  arrange(as.numeric(as.character(fire_year)))

##OVERALL BAR - COUNT OF CASES IN QUARTILE PER YEAR / COUNT OF CASES PER YEAR
plot_2_yr_totals <- plot_2_data |>
  group_by(fire_year) |>
  summarise(yr_count = n()) |>
  ungroup()

plot_2_yr_q_totals <- plot_2_data |> 
  group_by(fire_year, quartile) |> 
  summarise(yr_q_count = n()) |> 
  ungroup()

plot_2_yr_q_totals <- plot_2_yr_q_totals |>
  left_join(plot_2_yr_totals, by = join_by(fire_year)) |>
  mutate(yr_q_pct = yr_q_count / yr_count)

##CAUSE BAR - COUNT OF CASES WITH CAUSE DATA / COUNT OF CASES PER YEAR
plot_2_yr_q_cause_totals <- plot_2_data |> 
  select(fire_year, quartile, cause) |> 
  filter(!is.na(cause)) |> 
  group_by(fire_year, quartile) |> 
  summarise(yr_q_cause_count = n()) |> 
  ungroup() |> 
  left_join(plot_2_yr_totals, by = join_by(fire_year)) |> 
  mutate(yr_q_cause_pct = yr_q_cause_count / yr_count)


##CAUSE FILL - COUNT OF CASES CAUSED BY HUMANS / COUNT OF CASES WITH CAUSE DATA
plot_2_yr_q_human_pct <- plot_2_data |> 
  select(fire_year, quartile, cause) |> 
  filter(!is.na(cause)) |> 
  group_by(fire_year, quartile, cause) |> 
  summarise(yr_q_cause_human_count = n()) |> 
  ungroup() |> 
  filter(cause == "HUMAN") |> 
  left_join(plot_2_yr_q_cause_totals, by = join_by(fire_year, quartile)) |> 
  mutate(yr_q_human_pct = yr_q_cause_human_count / yr_q_cause_count) |> 
  select(-yr_count)

##JOIN HUMAN PCT DATA TO TOTALS AND ADD 2023 AT 0
plot_2_yr_q_totals <- plot_2_yr_q_totals |> 
  left_join(plot_2_yr_q_human_pct, by = join_by(fire_year, quartile)) |> 
  add_row(fire_year = "2023", quartile = 1, yr_q_count = 0, yr_count = 0, yr_q_pct = 0,
          cause = "HUMAN", yr_q_cause_human_count = 0, yr_q_cause_count = 0, 
          yr_q_cause_pct = 0, yr_q_human_pct = 0) |>
  add_row(fire_year = "2023", quartile = 2, yr_q_count = 0, yr_count = 0, yr_q_pct = 0,
          cause = NA, yr_q_cause_human_count = NA, yr_q_cause_count = NA, 
          yr_q_cause_pct = NA, yr_q_human_pct = NA) |>
  add_row(fire_year = "2023", quartile = 3, yr_q_count = 0, yr_count = 0, yr_q_pct = 0,
          cause = NA, yr_q_cause_human_count = NA, yr_q_cause_count = NA, 
          yr_q_cause_pct = NA, yr_q_human_pct = NA) |>
  add_row(fire_year = "2023", quartile = 4, yr_q_count = 0, yr_count = 0, yr_q_pct = 0,
          cause = NA, yr_q_cause_human_count = NA, yr_q_cause_count = NA, 
          yr_q_cause_pct = NA, yr_q_human_pct = NA) |> 
  mutate(yr_q_cause_pct = case_when(
    is.na(yr_q_cause_pct) ~ 0, 
    .default = yr_q_cause_pct
  ))

##PLOT
ggplot(data =plot_2_yr_q_totals, mapping = aes(x = factor(fire_year))) + 
  geom_bar(aes(y = yr_q_pct), stat = "identity", fill = "gray", alpha = .5) + 
  geom_bar(aes(y = yr_q_cause_pct, fill = yr_q_human_pct), stat = "identity") + 
  scale_x_discrete(drop = FALSE) + 
  scale_fill_gradient(low = "yellow1", high = "darkblue", 
                      breaks = c(0.02, .67), 
                      labels = c("Lightning", "Human")) + 
  theme_minimal() + 
  facet_wrap(~ factor(quartile), nrow = 4, labeller = as_labeller(c("1" = "Largest Wildfires", "2" = "Substantial Wildfires", "3" = "Moderately Sized Wildfires", "4" = "Smallest Wildfires"))) + 
  labs(
  x = "",
  y = "", 
  title = "The largest fires are more often caused by lightning",
  subtitle = "As the size of fires increases, so does the percentage of those fires attributed to lightning.\n\nFire size data for 2023 not available.\nFire cause data for 2018-2024 not available.", 
  caption = "\nData from Governement of British Columbia fire tracking, 2012 - 2024.\nhttps://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics\nhttps://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/", 
  fill = "Cause (%)"
   ) + 
  theme_minimal() + 
  theme(axis.text.y = element_blank(),
        panel.grid.minor = element_blank(), 
        panel.grid.major.y = element_line(color = "gray80"), 
        panel.grid.major.x = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0), 
        legend.title = element_text(size = 10), 
        axis.title.x = element_text(size = 14), 
        axis.text.x = element_text(size = 12), 
        plot.title = element_text(size = 18))

While there is no discernable pattern in cause across the years of available data, there is a clear pattern in the distribution of cause across fire size. The percent cause attribution of fires to lightning increases as fire size increases, and reaches a peak in one of the largest two groups for every year in our set of available data.
Code
##REMOVE PLOT SPECIFIC TABLES
rm(plot_2_data, plot_2_yr_q_cause_totals, plot_2_yr_q_human_pct, plot_2_yr_q_totals, plot_2_yr_totals)

4.4. Looking to frequency as a measure of change in the wildfires of British Columbia

At this point we have found little evidence to support the idea that wildfire sizes in British Columbia have been changing significantly within the 12 year study window. But there are other ways that the impact of wildfires can change over time. One of those factors is the frequency of fires.

In the next section I will look for changes in the frequency of wildfires in each of the districts in British Columbia.


I will prepare my map data by dropping 2012 since it is not a full set of data and will skew percent change between 2012 and 2013. Then I will use a spatial join to join the districts to the fire data. Next, I will drop the geometry column and calculate the fire frequency per district per year, and offset that table by 1 year. Then I will return those offset values to the main map 3 fire data, for use in calculating the year-over-year % change. Once that is calculated, I will load the districts to the map, as well as graphical elements like a basemap, compass, and scale bar. I will base my district level fill on the calculated average percent change in fire frequency.

Code
##PREPARE OVERALL DATA WITH FREQUENCY
map_3_fire_data <- bc_fire_data |> 
  filter(drop_flag == FALSE) |> 
  filter(as.numeric(as.character(fire_year)) > 2012) |> ##NOT FULL DATA, PCT CHANGE WILL BE INVALID - DROP
  mutate(fire_year = as.numeric(as.character(fire_year))) |> 
  st_join(bc_districts, join = st_within, left = TRUE) |> 
  filter(!is.na(district)) |>  #DROP FIRES THAT DID NOT FALL WITHIN A DISTRICT
  st_drop_geometry() |> 
  group_by(district, fire_year) |> 
  summarise(n = n()) |> 
  ungroup()

##PREPARE OFFSET DATA FOR % CHANGE CALCULATION
map_3_fire_data_1yr_offset <- map_3_fire_data |> 
  mutate(shift_year = fire_year + 1) |> 
  select(district, shift_year, n) |> 
  rename(prior_year_n = n)

##JOIN OFFSET DATA, CALCULATE % CHANGE BY YEAR / DISTRICT, AND AVERAGE BY DISTRICT
map_3_pct_change_avg <- map_3_fire_data |> 
  left_join(map_3_fire_data_1yr_offset, by = join_by(fire_year == shift_year, district == district)) |> 
  filter(!is.na(prior_year_n)) |> 
  mutate(pct_change = ((n - prior_year_n) / prior_year_n) * 100) |> 
  group_by(district) |> 
  summarise(avg_pct_change = mean(pct_change, na.rm = TRUE))
  
##JOIN % CHANGE TO DISTRICTS AND DRAW MAP
bc_districts |> 
  left_join(map_3_pct_change_avg, by = join_by(district)) |> 
  tm_shape() + 
  tm_fill("avg_pct_change", 
          fill.scale = tm_scale_continuous(values = "YlOrRd", label.format = tm_label_format(suffix = "%")), 
          fill.legend = tm_legend(title = "Change in\nfires per year\n(Average)\n", frame = FALSE)
          ) + 
  tm_borders() + 
  tm_basemap("Esri.WorldGrayCanvas") + 
  tm_scale_bar(position = c("left", "bottom")) + 
  tm_compass(position = c("left", "top")) + 
  tm_title_out("\nData from Governement of British Columbia fire tracking, 2013 - 2024.\nhttps://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics\nhttps://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/", 
               position = tm_pos_out("center", "bottom"), 
               size = .6) + 
  tm_layout(
    main.title = "Changes in wildfire frequency across British Columbia, 2013 to 2024", 
    frame = TRUE
  )

The plot shows that wildfires have been occurring more frequently throughout British Columbia. This could be due in part to changes in data collection methodology over the years, but the general pattern of growth in fire frequency remains. Districts in northern B.C. show particularly rapid growth in the regularity of wildfires. Again, consideration must be given to the effect a small number of total fires might have on percentage growth, but the year over year increase in fire frequency is evident.
Code
rm(map_3_fire_data, map_3_fire_data_1yr_offset, map_3_pct_change_avg)

4.5. A look at the interaction between fire frequency and cause

We have now identified a meaningful pattern of growth in wildfire frequency over the study years. It is reasonable then to wonder if a relationship similar to the one we identified between fire cause and size exists between fire cause and frequency.

To gain insight into this question, I will split the data out by cause, and then plot the frequencies of each cause by year.


To investigate the relationship between wildfire frequency and cause, I will split the data into groups by cause, and then plot year on the x-axis and frequency on the y-axis to look for increases over time in each of the cause groups. I will not include 2012 in this group, since the 2012 data is already subsetted and does not represent a full-year’s frequency.

Code
##PREPARE OVERALL DATA
plot_4_data <- bc_fire_data |> 
  filter(drop_flag == FALSE) |> 
  st_drop_geometry() |> 
  select(fire_year, cause) |> 
  filter(!is.na(cause)) |> 
  filter(fire_year != 2012) |> 
  mutate(fire_year = factor(fire_year, levels = c("2013", "2014", "2015", "2016", 
                                                  "2017", "2018", "2019", "2020", "2021", 
                                                  "2022", "2023", "2024"))) |> 
  arrange(as.numeric(as.character(fire_year)))

##CALCULATE TOTAL FIRES BY YEAR AND QUARTILE
plot_4_yr_cause_totals <- plot_4_data |> 
  group_by(fire_year, cause) |> 
  summarise(yr_cause_count = n()) |> 
  ungroup()

##PLOT
ggplot(plot_4_yr_cause_totals, 
       aes(x = fire_year, y = yr_cause_count, color = cause, group = cause, fill = cause)) + 
  geom_line(linewidth = 1.5) + 
  geom_point(size = 3, color = "black", shape = 21) + 
  scale_fill_manual(values = c("HUMAN" = "mediumblue", "LIGHTNING" = "gold1"), 
                    labels = c("HUMAN" = "Human", "LIGHTNING" = "Lightning"), 
                    name = "Cause") + 
  scale_color_manual(values = c("HUMAN" = "mediumblue", "LIGHTNING" = "gold1"), 
                     labels = c("HUMAN" = "Human", "LIGHTNING" = "Lightning"), 
                     name = "Cause") + 
  labs(
  x = "",
  y = "Number of Fires", 
  title = "Lightning as a key driver in wildfire frequency",
  subtitle = "While the number of fires caused by humans holds steady from year to year, the number caused by lightning strikes varies wildly.\n\nFire cause data for 2018 - 2024 not available.", 
  caption = "\nData from Governement of British Columbia fire tracking, 2013 - 2024.\nhttps://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics\nhttps://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/", 
   ) + 
  theme_gray() + 
  theme(panel.grid.minor = element_blank(), 
      panel.grid.major.y = element_line(color = "gray80"), 
      panel.grid.major.x = element_line(color = "gray80"), 
      plot.caption = element_text(size = 7, hjust = 0), 
      legend.title = element_text(size = 10), 
      axis.title.x = element_text(size = 14), 
      axis.text.x = element_text(size = 12), 
      plot.title = element_text(size = 18))

It is clear that lightning strikes are the key driver of overall wildfire frequency, at least in the limited span of years where wildfire cause data was available. This insight higlights the critical downstream impacts that climate change and the associated meteorological shifts can have.
Code
##REMOVE PLOT SPECIFIC TABLES
rm(plot_4_data, plot_4_yr_cause_totals)

4.6. Looking for patterns in wildifre distribution by year and cause

To assess the overall distribution of wildfires throughout British Columbia, I will create an interactive map that includes layers for each of the study years, as well as layers for each of the relevant causes that we’ve examined in previous stages. This map will allow for a large number of comparative studies, and empower the viewer to address the data in many different ways to look for patterns across a variety of data groupings and scales.


I will use leaflet to create an interactive map of British Columbia that contains a number of layers for year and cause. I will define the core map elements and parameters, and then create each of the year-based layers using a for loop, including data popups. Once that is complete I will add the cause-based layers, also with popups, and round out the map by adding a series of controls for interactivity and design.

Code
##CREATE COLOR PALETTE FOR YEAR LAYERS
set.seed(99)
map_5_year_colors <- rainbow(13)
map_5_year_colors <- sample(map_5_year_colors)

##DEFINE BASE MAP
map_5 <- leaflet() |> 
  addTiles() |> 
  addPolygons(data = bc_border, 
              fillColor = "white", 
              color = "black", 
              weight = 1, 
              fillOpacity = 0)

map_5_groups <- bc_fire_data |> 
  distinct(fire_year)

##LOOP THROUGH YEARS TO ADD YEAR LAYERS
i = 1
for(grp in map_5_groups$fire_year) {
  map_group <- as.character(grp)

  map_5_group_data <- bc_fire_data |> 
    filter(fire_year == map_group) |> 
    mutate(current_size = case_when(
      is.na(name) ~ "Unk", 
      .default = as.character(current_size)
    )) |> 
    mutate(fire_year = case_when(
      is.na(fire_year) ~ "Unk", 
      .default = fire_year
    )) |> 
    mutate(current_size = case_when(
      is.na(current_size) ~ "Unk", 
      .default = paste0(current_size, " ha")
    )) |> 
    mutate(geographic_description = case_when(
      is.na(geographic_description) ~ "None", 
      .default = geographic_description
    ))

  map_5 <- map_5 |> 
    addCircles(
      data = map_5_group_data, 
      popup = ~paste0("Name: ", name, "<br>",
                      "Year: ", fire_year, "<br>",
                      "Size: ", current_size, "<br>",
                      "Desc: ", geographic_description),
      color = map_5_year_colors[i], 
      group = map_group
    )
  i = i + 1
  }

##ADD HUMAN CAUSE LAYER
map_5_human <- bc_fire_data |> 
  filter(cause == "HUMAN")

map_5 <- map_5 |> 
  addCircles(
    data = map_5_human, 
    popup = ~paste0("Name: ", name, "<br>",
                "Year: ", fire_year, "<br>",
                "Size: ", current_size, "<br>",
                "Desc: ", geographic_description),
    group = "Caused by human", 
    color = "mediumblue")

##ADD LIGHTNING CAUSE LAYER
map_5_lightning <- bc_fire_data |> 
  filter(cause == "LIGHTNING")

map_5 <- map_5 |> 
  addCircles(
    data = map_5_lightning, 
    popup = ~paste0("Name: ", name, "<br>",
                    "Year: ", fire_year, "<br>",
                    "Size: ", current_size, "<br>",
                    "Desc: ", geographic_description),
    group = "Caused by lightning", 
    color = "gold")

##ADD LAYER SELECTION CONTROL
map_5 <- map_5 |> 
  addLayersControl(
   overlayGroups = c(as.character(map_5_groups$fire_year), "Caused by human", "Caused by lightning"), 
   options = layersControlOptions(collapsed = TRUE)
   ) |> 
  hideGroup(c(as.character(map_5_groups$fire_year), "Caused by human", "Caused by lightning")) |> 
  addScaleBar(position = "bottomleft", options = scaleBarOptions(metric = TRUE, imperial = FALSE)) |> 
  addMiniMap(position = "bottomleft", 
             width = 120, 
             height = 170, 
             toggleDisplay = TRUE) |> 
  addControl('<p style = "line-height: 1;">
              <span style="font-size: 14px;">
              <b>
              Distribution of wildfires in British Columbia, 2012-2024
              </b>
              </span>
              <br>
              <span style="font-size: 12px;">
              Including wildfire distribution by cause
              </span>', 
             position = "bottomright")


##DISPLAY MAP
map_5

Data from Governement of British Columbia fire tracking, 2012 - 2024. https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics https://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/

Analysing the available data using this interactive map shows that wildfires are more concentrated in southern and western British Columbia. There is however no clear change in the distribution pattern over time. It is also noteworthy that the distribution of wildfires started by humans is concentrated along roadways and population centers, while fires started by lightning are more broadly distributed.

5. Wrap-up

By considering a variety of factors using both spatial and temporal analysis technques, we have started to gain a better understanding of what does and does not impact wildfires in British Columbia. The patterns we’ve seen, in particular the impact of lightning on the severity of seasonal wildfires, can be used to guide decision making, better direct resources, and allow for more effective wildfire management.

Footnotes

  1. https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-statistics↩︎

  2. https://www.for.gov.bc.ca/ftp/HPR/external/!publish/Maps_and_Data/GoogleEarth/WMB_Fires/↩︎

  3. https://ciffc.net/pdfs/stages-of-control-and-response.pdf↩︎

  4. https://stackoverflow.com/questions/26886393/converting-ddm-to-dec-decimal-degrees↩︎